MLOps Lifecycle & Production Monitoring Guide - Healthcare Client Focus
Healthcare Actuarial Models — Azure Databricks
Assumed Stack: Azure Databricks · Unity Catalog · MLflow · Delta Lake · dbt · Azure DevOps / Git · PySpark / Python / SQL
Inference Pattern: Batch scoring (no real-time endpoints)
Version: 1.0 — February 2026
Table of Contents
1. Foundational Principles
The Databricks MLOps guidance (often referred to as the “Big Book of MLOps”) establishes a maturity model that moves teams from ad-hoc notebook experimentation toward fully governed, automated model lifecycles. For healthcare actuarial models — where predictions drive care decisions and financial projections — we target MLOps Maturity Level 3: automated pipelines, governed model promotion, continuous monitoring, and auditable lineage.
Five principles anchor every decision in this guide:
Models are code. Model training logic, feature engineering, and scoring pipelines live in version-controlled repositories in Azure DevOps, not in manually-edited notebooks.
Data is a first-class artifact. Every model depends on the quality and stability of its input data. dbt transformations and Delta Lake versioning give us reproducibility from raw claims through scored output.
Environments are isolated. Development, Staging, and Production are separate Databricks workspaces (or at minimum, separate catalogs in Unity Catalog). Code promotes across environments; data does not leak between them.
Promotion is gated, not automatic. No model reaches Production without passing validation checks, statistical comparison to the current champion, and human approval from the Actuarial team.
Monitoring is not optional. A model without monitoring is a liability. Every model in Production has drift detection, performance tracking, and an alerting contract.
2. Roles & Responsibilities Matrix
The MLOps lifecycle is not a single team’s job. It requires coordinated handoffs between roles, each owning a specific slice of the pipeline. In a healthcare context, we add Actuarial SMEs and Compliance as first-class participants — they are not consulted at the end, they are embedded throughout.
2.1 Role Definitions
Data Engineer (DE)
Owns the data platform. Responsible for ingesting raw data (claims, eligibility, pharmacy, lab, EMR), building and maintaining dbt models that produce Bronze → Silver → Gold layers in Delta Lake, and ensuring data quality and freshness SLAs. The DE does not build ML models but is accountable for the data those models consume. In practice, the DE is the first person paged when a model monitoring alert fires, because the root cause is data more often than it is model logic.
Data Scientist (DS)
Owns the model. Responsible for exploratory analysis, feature engineering, model training, hyperparameter tuning, and validation. The DS works in the Development environment, using MLflow to track experiments. They write the training code that will eventually run as an automated pipeline in Production. The DS also owns the statistical validation that a retrained challenger model is equivalent to or better than the current champion. For actuarial models, the DS works hand-in-hand with the Actuarial SME to ensure clinical and financial validity.
MLOps / ML Engineer (MLE)
Owns the production pipeline. Responsible for converting the DS’s training notebooks into production-grade, parameterized jobs; building the scoring pipeline; implementing CI/CD in Azure DevOps; configuring monitoring and alerting; and operating the champion/challenger promotion workflow. The MLE is the bridge between “it works in my notebook” and “it runs reliably at 2 AM every Sunday.” They own the MLflow Model Registry lifecycle (None → Staging → Production → Archived) and the Databricks Workflows that orchestrate scoring.
Infrastructure / Platform Engineer (Infra)
Owns the Databricks platform, Azure networking, compute provisioning, and infrastructure-as-code (Terraform or Bicep for Azure resources, Databricks Asset Bundles for workspace objects). Responsible for cluster policies, instance pools, Unity Catalog metastore configuration, Azure Key Vault integration, and Private Link / VNet injection. They do not touch model code but ensure the platform is secure, cost-efficient, and reliable.
Security & Compliance (Sec)
Owns access control, PHI protection, and regulatory compliance. Responsible for Unity Catalog permissions (table-level, column-level masking for PHI), Azure AD group mappings, audit log configuration, and ensuring the platform meets HIPAA, state DOI, and CMS requirements. In healthcare, this role also reviews model monitoring dashboards for any inadvertent PHI exposure and validates that audit trails are complete.
Actuarial SME / Business Owner (ACT)
Owns the business logic and acceptance criteria. Responsible for defining model requirements, reviewing validation reports, approving model promotions to Production, and interpreting monitoring results in a clinical and financial context. The Actuarial team is the final approval gate — no model goes live without their sign-off. They also define the performance thresholds that trigger retraining alerts (e.g., “if the predictive ratio deviates more than 5% from 1.0, we need to investigate”).
2.2 RACI by Lifecycle Phase
| Phase | DE | DS | MLE | Infra | Sec | ACT |
|---|---|---|---|---|---|---|
| Data Ingestion & dbt Pipeline | A/R | C | C | C | C | I |
| Feature Engineering | C | A/R | C | I | C | C |
| Model Training & Experimentation | I | A/R | C | I | I | C |
| Model Validation & Testing | C | A/R | R | I | I | A |
| CI/CD Pipeline & Automation | C | C | A/R | C | C | I |
| Model Registry & Promotion | I | R | A/R | I | C | A |
| Scoring Pipeline (Production) | C | C | A/R | C | C | I |
| Monitoring & Drift Detection | C | R | A/R | C | I | C |
| Alerting & Incident Response | R | R | A/R | C | I | C |
| Retraining Decision | I | R | R | I | I | A |
| Security & Access Control | C | I | C | R | A/R | I |
| Regulatory Audit Support | C | C | C | C | A/R | R |
A = Accountable (final decision), R = Responsible (does the work), C = Consulted, I = Informed
3. The MLOps Lifecycle — End to End
This section walks through the complete lifecycle as it applies to the four distinct actuarial models being migrated into Databricks. Each phase is annotated with who does the work and what artifacts are produced.
3.1 Data Ingestion & Feature Store
Owner: Data Engineer
Artifacts: dbt models, Delta tables in Unity Catalog, data quality reports
Raw data lands in ADLS Gen2 from upstream systems (claims adjudication, eligibility, pharmacy benefits, lab feeds, EMR extracts, vendor files). The DE builds dbt / databricks models that transform raw data through the medallion architecture:
Bronze: Raw ingestion, append-only, schema-on-read. Minimal transformation — just land it.
Silver: Cleaned, deduplicated, conformed. Claims are adjudicated, eligibility spans are resolved, member keys are unified. dbt tests enforce referential integrity and not-null constraints.
Gold: Model-ready feature tables. Aggregations at the member-month or member-quarter grain. Lookback windows applied (e.g., 12-month rolling claims for Risk Stratification, 6-month diagnosis history for Palliative Care). These Gold tables are the contract between DE and DS.
The DE registers Gold tables in Unity Catalog under a features schema. For models that share features (Risk Stratification and Concurrent Risk Score likely share claims-based features), the DE publishes shared feature tables to avoid duplication.
Data quality is enforced at every layer using dbt tests (unique, not_null, accepted_values, relationships) and optionally Great Expectations for statistical checks. Quality results are written to a data_quality.test_results Delta table for monitoring.
3.2 Experimentation & Model Development
Owner: Data Scientist
Artifacts: MLflow experiments, training notebooks, feature importance analysis, validation notebooks
The DS works in the appropriate Databricks workspace (or catalog). They read from feature tables, explore data, engineer additional features, and train models. All experiments are logged to MLflow Tracking:
Parameters (hyperparameters, feature sets, lookback windows)
Metrics (AUC, precision, recall, calibration, Brier score, predictive ratio, R² on cost)
Artifacts (model binaries, feature importance plots, calibration curves, SHAP summaries)
Input data signature and example input (for schema enforcement downstream)
For the actuarial models, the DS pays special attention to:
Risk Stratification: Calibration is as important as discrimination. The model must not just rank members correctly but produce well-calibrated probability estimates that translate to expected cost.
Palliative Care: Sensitivity at high specificity — false negatives (missing a patient who would benefit from palliative care) are more costly than false positives. The DS should log precision-recall curves at multiple thresholds.
Concurrent Risk Score: This is a proprietary scoring methodology. The DS must document the algorithm specification thoroughly because this model may have regulatory or contractual IP implications. All coefficients, weights, and business rules must be version-controlled.
Impact Model: TBD
The DS does not deploy models. When satisfied with a candidate, they register it in the MLflow Model Registry with a None or Staging stage and create a pull request in Azure DevOps that triggers the CI/CD pipeline.
3.3 CI/CD Pipeline & Automated Validation
Owner: MLOps Engineer
Artifacts: Azure DevOps pipeline YAML, test reports, validation notebooks, promotion records
The MLE builds and maintains the CI/CD pipeline in Azure DevOps. The pipeline is triggered by a pull request to the main branch (or a release/* branch) and executes the following stages:
CI Stage (triggered on PR):
Lint and static analysis on all Python/PySpark code (ruff, mypy)
Unit tests for feature engineering functions and scoring logic
Integration test: run the training pipeline on a small sample dataset in the Staging workspace
Validate the resulting model against the current Production champion:
- Load the champion model from MLflow Registry (Production stage)
- Load the challenger model from the CI run
- Score both on a held-out validation dataset
- Compare metrics (AUC, calibration, predictive ratio, etc.)
- Compute Population Stability Index (PSI) between champion and challenger score distributions
- Generate a validation report artifact
CD Stage (triggered on merge to main, after approval):
Deploy training pipeline as a Databricks Workflow in the Production workspace
Deploy scoring pipeline as a Databricks Workflow
Deploy monitoring notebooks and alerting configuration
Register the model in the Production MLflow Registry as
StagingDo not promote to
Productionstage automatically — this requires Actuarial approval (see Section 3.4)
The pipeline YAML templates are designed to be reusable across all four models. Model-specific configuration (feature table paths, metric thresholds, scoring schedule) is parameterized in a model_config.yml file per model.
3.4 Model Promotion & Governance
Owner: MLOps Engineer (executes) · Actuarial SME (approves)
Artifacts: Promotion request, validation report, approval record, model card
Model promotion follows a strict governance workflow:
MLE generates a Promotion Request containing: validation report, metric comparison, drift analysis, data lineage, and a model card.
DS reviews the statistical validity and signs off.
ACT reviews the business validity — do the scores make clinical and financial sense? Are there any subpopulation biases? Is the model aligned with the actuarial filing?
Sec confirms that no new PHI columns were introduced and access controls are correct.
MLE transitions the model in MLflow Registry from
StagingtoProductionand archives the previous champion.
This promotion is logged as an auditable event. In Unity Catalog and MLflow, the full lineage is preserved: which data version trained the model, which code version produced it, who approved it, and when.
3.5 Batch Scoring in Production
Owner: MLOps Engineer
Artifacts: Scored Delta tables, scoring run logs, output delivery confirmations
Batch scoring is the steady-state operation for all four models. The MLE owns the Databricks Workflows that orchestrate scoring:
Trigger: Scheduled (e.g., monthly for Risk Stratification, weekly for Palliative Care) or event-driven (new claims data loaded).
Load model: The scoring notebook loads the
Productionstage model from MLflow Registry usingmlflow.pyfunc.load_model("models:/<model_name>/Production").Load features: Read from the Gold feature tables. Validate row counts and data freshness before scoring.
Score: Apply the model to produce predictions. For batch, this is a
model.predict()call over a Spark DataFrame, leveragingmlflow.pyfunc.spark_udf()for distributed scoring.Post-process: Apply business rules, score banding, exclusion logic, and output formatting.
Write output: Scored results written to a
scored_outputDelta table in the Gold layer, partitioned by scoring date.Deliver: Outputs pushed to downstream systems (care management platform, actuarial reporting, Power BI datasets) via ADF, JDBC, or file export.
Log: Scoring metadata (model version, input row count, output row count, min/max/mean score, runtime, cluster ID) written to a
scoring_runsDelta table.
Failure handling: If any step fails, the Workflow retries once, then alerts the MLE via PagerDuty/Teams. The scoring table is not updated with partial results — it is all-or-nothing per run.
4. Production Model Monitoring — Deep Dive
This is the most critical section of this guide. A model in Production without monitoring is a model accumulating silent risk. For healthcare actuarial models, silent degradation can mean misidentified high-risk members, missed palliative care candidates, or inaccurate financial projections.
Monitoring for batch prediction models differs fundamentally from real-time endpoint monitoring. There is no request latency to track, no throughput to measure, no endpoint availability to check. Instead, we monitor the statistical properties of predictions and inputs over time, and we do so on a per-scoring-run cadence.
4.1 What We Monitor (The Four Pillars)
| Pillar | What It Detects | Urgency | Primary Owner |
|---|---|---|---|
| Data Quality & Pipeline Health | Missing data, schema changes, volume anomalies, stale features, dbt test failures | Immediate — blocks scoring | Data Engineer |
| Feature Drift (Input Drift) | Shifts in the statistical distribution of model input features between training and scoring | Days to weeks — early warning | Data Scientist + MLE |
| Prediction Drift (Output Drift) | Shifts in the distribution of model predictions (scores) over time | Days to weeks — early warning | MLE + Actuarial |
| Performance Degradation (Concept Drift) | Decline in model accuracy/calibration when ground truth becomes available | Weeks to months — lagged but highest severity | Data Scientist + Actuarial |
These four pillars are layered intentionally. Data quality issues surface first (within hours of a bad data load). Feature drift surfaces next (within the current scoring cycle). Prediction drift surfaces alongside feature drift. Performance degradation surfaces last, because ground truth in healthcare is delayed — you don’t know if a Risk Stratification prediction was correct until claims mature, which can take 3-12 months.
4.2 Monitoring Architecture
The monitoring system is itself a set of Databricks Workflows and Delta tables. It is not a separate platform — it lives in the same workspace as the models it monitors, which keeps lineage intact and avoids data export.

5. Observability Architecture
Observability goes beyond monitoring. Monitoring tells you something is wrong. Observability tells you why it is wrong and where to look. For batch scoring models, observability means being able to trace any individual prediction back through the pipeline to the raw data that produced it.
5.1 The Three Pillars of ML Observability
Logging
Every scoring run produces structured logs written to the scoring_runs Delta table:
| Column | Type | Example |
|---|---|---|
run_id |
string (UUID) | a1b2c3d4-... |
model_name |
string | risk_stratification |
model_version |
int | 7 |
mlflow_run_id |
string | mlflow-run-abc123 |
score_date |
date | 2026-02-15 |
input_row_count |
long | 1,245,000 |
output_row_count |
long | 1,245,000 |
score_mean |
double | 0.342 |
score_median |
double | 0.287 |
score_std |
double | 0.198 |
score_min |
double | 0.001 |
score_max |
double | 0.997 |
score_p10 |
double | 0.089 |
score_p90 |
double | 0.621 |
null_prediction_count |
long | 0 |
feature_table_version |
long | 42 (Delta version) |
cluster_id |
string | 0215-143022-abc |
runtime_seconds |
int | 847 |
status |
string | SUCCESS |
error_message |
string | null |
This table is the single source of truth for “what happened.” The MLE builds it; the DS, ACT, and Sec consume it.
Metrics
Quantitative signals computed per scoring run and stored in the drift_metrics and performance_metrics Delta tables. Covered in detail in Sections 6 and 7.
Traces (Lineage)
Every prediction can be traced back:
Prediction → Scoring Run: The
scored_outputtable includesrun_idandmodel_version.Scoring Run → Model: The
run_idmaps to an MLflow run, which records the model artifact, training data version, and code commit.Model → Training Data: MLflow logs the Delta table version used for training via
mlflow.log_input().Training Data → Raw Source: dbt lineage graphs trace Gold features back through Silver and Bronze to raw ingestion.
This full lineage is critical for healthcare audit. When a regulator or auditor asks “why was this member scored as high-risk?”, you can trace the answer from the prediction all the way back to the claims data that drove it.
5.2 Observability by Role
| Role | What They Observe | Where They Look |
|---|---|---|
| DE | Data freshness, dbt test results, row counts, schema changes | dbt Cloud dashboard, data_quality.test_results table, ADF monitoring |
| DS | Feature distributions, model metrics, SHAP values, calibration | MLflow UI, monitoring dashboard, ad-hoc notebooks |
| MLE | Scoring run health, drift metrics, pipeline failures, alert history | Databricks Workflows UI, monitoring dashboard, PagerDuty |
| Infra | Cluster performance, job costs, workspace health, network issues | Azure Monitor, Databricks admin console, cost dashboards |
| Sec | Access logs, PHI access events, permission changes | Unity Catalog audit logs, Azure AD logs, SIEM |
| ACT | Score distributions, population shifts, business metric alignment | Monitoring dashboard (read-only), monthly model health report |
6. Drift Detection Framework
Drift detection is the early warning system. It does not tell you the model is wrong — it tells you the world the model was trained on may no longer match the world the model is scoring in. For healthcare, drift is common: member populations shift during open enrollment, coding practices change with ICD updates, pandemic-era utilization patterns normalize.
6.1 Feature Drift (Input Drift)
Feature drift measures whether the distribution of each input feature at scoring time has shifted from its distribution at training time. The reference distribution is computed once when the model is trained and stored as an artifact.
Method: Population Stability Index (PSI)
PSI is the standard metric for detecting distribution shifts in actuarial and credit risk modeling. It works for both continuous and categorical features.
PSI = Σ (Actual% - Expected%) × ln(Actual% / Expected%)
Where Actual% is the proportion of observations in each bin at scoring time, and Expected% is the proportion at training time. For continuous features, use 10 equal-frequency bins from the training distribution.
| PSI Value | Interpretation | Action |
|---|---|---|
| < 0.10 | No significant shift | None |
| 0.10 – 0.25 | Moderate shift, investigate | DS reviews; logged to dashboard |
| > 0.25 | Significant shift | Alert triggered; DS + ACT investigate |
Which features to monitor: Not all features. The MLE and DS jointly select the top 15-20 features by importance (from SHAP or model-native feature importance). Monitoring all 200+ features creates noise. Focus on the ones that actually drive predictions.
Typical Healthcare Model-specific feature drift concerns:
Risk Stratification: Monitor diagnosis code density (count of unique HCCs), pharmacy cost features, and inpatient admission counts. These shift during open enrollment and after CMS HCC model updates.
Palliative Care: Monitor ADL (Activities of Daily Living) scores, hospitalization frequency, and diagnosis severity markers. These shift as the population ages or as clinical documentation practices change.
Concurrent Risk Score: Monitor the concurrent claims features closely — this model is sensitive to claims maturity lag. If the scoring pipeline runs before claims are fully adjudicated, feature distributions will appear to shift when the real issue is data completeness.
Impact Model: Monitor intervention enrollment features and comparison group characteristics. Selection bias in who receives interventions can create apparent drift.
Implementation — who does what:
DS identifies the features to monitor and computes the reference distributions during training. Stores them as a JSON artifact in MLflow.
MLE builds the monitoring notebook that loads the reference distributions, computes PSI for each feature on the current scoring run’s input data, and writes results to
drift_metrics.MLE configures alerting thresholds.
DS investigates when alerts fire, determining whether the drift is real (population changed) or artifactual (data pipeline issue).
DE investigates if the DS suspects a data pipeline issue.
ACT is consulted if the drift is real, to determine whether retraining is needed.
6.2 Prediction Drift (Output Drift)
Prediction drift measures whether the distribution of model outputs (scores) has shifted from a baseline. The baseline can be either the training-time score distribution or a recent stable period.
Method: PSI on score distributions + summary statistic tracking
The MLE computes PSI on the score distribution (10 bins) and also tracks:
Mean score over time (trend detection)
Score decile boundaries over time (are thresholds shifting?)
Proportion of scores in each risk tier (for Risk Stratification: what % is High/Medium/Low?)
Prediction drift without feature drift is unusual and suggests a bug. Prediction drift with feature drift suggests a real population shift or concept drift. Feature drift without prediction drift means the model is robust to that particular shift — no action needed.
| Scenario | Feature Drift? | Prediction Drift? | Likely Cause | Action |
|---|---|---|---|---|
| A | No | No | Stable | None |
| B | Yes | No | Model is robust | Log and monitor |
| C | Yes | Yes | Population shift or concept drift | DS investigates; possible retrain |
| D | No | Yes | Bug in scoring pipeline | MLE investigates immediately |
6.3 Concept Drift
Concept drift means the relationship between features and the target has changed — the model’s learned patterns are no longer valid. This is detected through performance monitoring (Section 7), not through distribution checks. In healthcare, concept drift happens when:
CMS changes the HCC risk adjustment model
A new drug or treatment changes utilization patterns
A pandemic creates a temporary shift in healthcare behavior
State regulations change covered benefits
Concept drift is the most dangerous form of drift because it means the model is confidently wrong. It is also the hardest to detect quickly because it requires ground truth, which in healthcare is delayed.
7. Performance & Accuracy Monitoring
7.1 The Ground Truth Lag Problem
For healthcare models (especially actuarial models reliant on claims), ground truth does not arrive in real time. The delay depends on the model:
| Model | What “Ground Truth” Is | Typical Lag |
|---|---|---|
| Risk Stratification | Actual total cost of care for the member over the prediction period | 6-12 months (claims run-out) |
| Palliative Care | Whether the member was enrolled in palliative/hospice care, or died, within the prediction window | 3-6 months |
| Concurrent Risk Score | Actual concurrent period cost (after claims maturity) | 3-6 months (IBNR completion) |
| Impact Model | Measured ROI or clinical outcome of the intervention vs. comparison group | 6-12 months |
This lag means you cannot compute real-time accuracy. Instead, performance monitoring operates on a delayed, retrospective basis. The monitoring pipeline joins historical predictions with matured ground truth and computes performance metrics on a rolling basis.
7.2 Performance Metrics by Model
Risk Stratification:
| Metric | Description | Alert Threshold |
|---|---|---|
| AUC-ROC | Discrimination — can the model separate high-cost from low-cost members? | Drop > 0.03 from baseline |
| Predictive Ratio | Predicted cost / Actual cost, overall and by decile. Should be ~1.0. | Outside 0.90 – 1.10 |
| Calibration by Risk Tier | Mean predicted vs. actual cost for each risk tier (High/Med/Low) | Any tier off by > 15% |
| R² on Cost | Variance in actual cost explained by predicted cost | Drop > 0.05 from baseline |
| Decile Lift | Ratio of actual cost in top decile to average. Measures concentration. | Drop > 10% |
Palliative Care:
| Metric | Description | Alert Threshold |
|---|---|---|
| AUC-ROC | Discrimination for palliative care eligibility | Drop > 0.03 |
| Sensitivity at 90% Specificity | How many true positives are we catching at a fixed false positive rate? | Drop > 0.05 |
| Positive Predictive Value at Operating Threshold | Of those flagged, how many truly needed palliative care? | Drop > 10% |
| Calibration Curve | Predicted probability vs. observed rate, across deciles | Visual deviation |
| Brier Score | Overall calibration + discrimination combined | Increase > 0.02 |
Concurrent Risk Score:
| Metric | Description | Alert Threshold |
|---|---|---|
| Predictive Ratio | Predicted risk score / Actual concurrent cost, by score band | Outside 0.90 – 1.10 |
| R² on Cost | Explanatory power | Drop > 0.05 |
| Mean Absolute Error by Score Band | Accuracy within each score tier | Increase > 15% in any band |
| Population-Level Accuracy | Total predicted cost vs. total actual cost (aggregate calibration) | Off by > 3% |
Impact Model:
| Metric | Description | Alert Threshold |
|---|---|---|
| Estimated Treatment Effect Stability | Is the measured impact consistent over time? | Change > 20% from baseline |
| Covariate Balance | Are treatment and comparison groups still balanced on observables? | Standardized mean difference > 0.1 on any key covariate |
| Statistical Significance | Is the measured impact still statistically significant? | p-value crossing 0.05 |
| ROI Estimate Stability | Is the financial ROI estimate stable as more data matures? | Variance > 25% quarter-over-quarter |
7.3 Performance Monitoring Pipeline — Implementation
Who builds it: The MLE builds the pipeline infrastructure. The DS defines the metrics and validation logic. The ACT defines the alert thresholds.
The pipeline runs as a scheduled Databricks Workflow, typically monthly (aligned with claims maturity cycles):
Step 1: Identify scoring runs with matured ground truth
(e.g., predictions from 6+ months ago where claims have run out)
Step 2: Join predictions with ground truth outcomes
(scored_output JOIN claims_summary ON member_id, prediction_period)
Step 3: Compute performance metrics (AUC, calibration, predictive ratio, etc.)
Step 4: Write metrics to performance_metrics Delta table
with columns: model_name, metric_name, metric_value, evaluation_date,
prediction_date_range, ground_truth_as_of_date, model_version
Step 5: Compare current metrics to baseline (training-time metrics stored in MLflow)
Step 6: If any metric crosses alert threshold → trigger alert
Step 7: Update monitoring dashboard
7.4 Subpopulation Monitoring
Aggregate metrics can mask subpopulation degradation. A model can have stable overall AUC while performing terribly for a specific subgroup. For healthcare actuarial models, this is both a performance issue and a fairness issue.
The DS and ACT jointly define subpopulations to monitor:
Age bands (pediatric, adult, 65+)
Line of business (Medicare, Medicaid, Commercial, ACA Exchange)
Chronic condition cohorts (diabetes, ESRD, behavioral health, oncology)
Geography (state, region, urban/rural)
New members vs. continuing members (new members have incomplete claims history)
Performance metrics are computed for each subpopulation. An alert fires if any subpopulation’s metric crosses its threshold, even if the overall metric is stable.
Who monitors subpopulations: The MLE builds the computation. The DS reviews the results. The ACT defines which subpopulations matter and what “fair” performance looks like. The Sec/Compliance team reviews for disparate impact.
8. Retraining Pipeline & Champion/Challenger
8.1 When to Retrain
Retraining is not scheduled on a fixed cadence by default. It is triggered by evidence:
| Trigger | Source | Who Decides |
|---|---|---|
| Feature drift PSI > 0.25 on multiple key features | Drift monitoring pipeline | DS recommends, ACT approves |
| Prediction drift PSI > 0.25 | Drift monitoring pipeline | DS recommends, ACT approves |
| Performance metric crosses alert threshold | Performance monitoring pipeline | DS recommends, ACT approves |
| External event (CMS HCC model update, ICD code revision, regulatory change) | Actuarial team / industry knowledge | ACT initiates, DS executes |
| Scheduled periodic refresh (if org policy requires it) | Calendar (e.g., annual for Risk Strat) | ACT mandates |
Note that in all cases, the Actuarial team has approval authority. This is a healthcare governance requirement — model changes can affect member care and financial projections.
8.2 The Retraining Workflow

8.3 Shadow Scoring
For high-risk models (Risk Stratification and Concurrent Risk Score directly affect financial projections), the MLE implements shadow scoring before full promotion:
The Production scoring pipeline continues using the champion model for official output.
A parallel pipeline scores the same input data with the challenger model and writes results to a
shadow_scoresDelta table.The DS and ACT compare champion and challenger outputs side-by-side for 1-2 scoring cycles.
Only after shadow scoring confirms stability does the ACT approve promotion.
Shadow scoring adds cost (double compute) but dramatically reduces risk for models that drive actuarial filings or care management programs.
8.4 Automated Retraining Pipeline
The MLE builds the retraining pipeline as a parameterized Databricks Workflow:
Fetch latest training data from Gold feature tables (with a defined lookback window).
Train using the same code that was validated in CI/CD (pulled from Git, not copy-pasted).
Log everything to MLflow (parameters, metrics, artifacts, data version).
Register the new model in MLflow Registry as a new version.
Run automated validation against the current champion.
Generate validation report and notify the DS and ACT.
The pipeline does not auto-promote. It prepares everything for human review. (Assumsing gated ‘human in the loop’ governance and process controls)
9. Alerting, Escalation & Incident Response
9.1 Alert Tiers
| Tier | Severity | Example | Response Time | Who Is Notified |
|---|---|---|---|---|
| P1 — Critical | Scoring pipeline failed, no output produced | Workflow failure, OOM error, data table missing | < 1 hour | MLE (paged), DE, DS |
| P2 — High | Scoring succeeded but output is suspect | Null predictions > 0, row count mismatch, extreme score distribution shift | < 4 hours | MLE, DS, ACT |
| P3 — Medium | Drift detected, investigation needed | Feature PSI > 0.25, prediction drift detected | < 1 business day | DS, MLE, ACT (informed) |
| P4 — Low | Performance metric degraded (lagged) | AUC declined by 0.02, predictive ratio shifted to 0.92 | < 1 week | DS, ACT, MLE (informed) |
9.2 Escalation Path
Alert fires
│
▼
MLE triages (is it infra, data, or model?)
│
├──▶ Infrastructure issue → Infra team (cluster, network, permissions)
│
├──▶ Data issue → DE triages (stale data, schema change, dbt failure)
│ │
│ ▼
│ DE fixes data pipeline → MLE re-runs scoring
│
└──▶ Model issue → DS investigates (drift, degradation, bug)
│
├──▶ Minor: DS documents, adjusts thresholds, monitors
│
└──▶ Major: DS recommends retraining → ACT approves → retrain workflow
9.3 Incident Postmortem
Every P1 and P2 incident gets a postmortem within 5 business days. The MLE facilitates and documents:
Timeline of events
Root cause (5 Whys)
Impact (which downstream systems were affected, for how long)
Remediation (what was done to fix it)
Prevention (what changes prevent recurrence)
Postmortems are stored in the Central Repository (i.e Azure DevOps wik and linked to the relevant ADO work item). For healthcare models, postmortems also note whether any member care decisions were affected by the incident.
10. Healthcare-Specific Considerations
10.1 HIPAA and PHI in Monitoring
Monitoring dashboards and alert messages must not contain PHI. This means:
Drift metrics are computed on aggregate distributions, not individual members. Safe.
Performance metrics are computed on aggregates. Safe.
Scoring run logs contain summary statistics, not member-level data. Safe.
Alert messages reference run IDs and metric values, never member IDs. Safe.
Danger zone: Debugging a scoring failure may require inspecting individual records. This must happen in the Production workspace with appropriate access controls, and access must be logged via Unity Catalog audit.
Responsible role: Security & Compliance defines the rules. MLE builds the dashboards to comply. Infra configures audit logging.
10.2 Regulatory Model Validation
Some of these models may be subject to regulatory review (e.g., CMS for Medicare Advantage risk adjustment, state DOI for rate filings). The monitoring system must support:
Audit trail: Every model version, training dataset version, validation report, and promotion decision is preserved and retrievable.
Reproducibility: Given a model version and a data version, any historical scoring run can be reproduced exactly. Delta Lake time travel and MLflow artifact storage make this possible.
Documentation: Model cards, validation reports, and monitoring summaries must be exportable for regulatory submission.
Responsible role: Actuarial SME prepares regulatory documentation. DS provides technical content. MLE ensures the infrastructure supports reproducibility. Sec/Compliance reviews before submission.
10.3 Claims Maturity and IBNR
Models such as a Concurrent Risk Score model is particularly sensitive to claims maturity. Incurred But Not Reported (IBNR) claims mean that recent claims data is incomplete. The monitoring pipeline must account for this:
Feature drift checks should compare against a training distribution that also had similar claims maturity (e.g., compare 3-month matured data against 3-month matured training data, not against fully matured training data).
Performance metrics should only be computed on fully matured periods (typically 6+ months of run-out).
The DE tags data with a
claims_maturity_flagindicating completeness.
10.4 Annual HCC Model Updates
CMS updates the HCC risk adjustment model annually. When this happens:
ACT notifies the team of the update and its implications.
DE updates the HCC grouper logic in the dbt pipeline.
DS evaluates whether the Risk Stratification and Concurrent Risk Score models need retraining (they almost certainly do).
MLE coordinates the retraining and re-validation cycle.
ACT approves the updated models before the effective date.
This is a planned event, not a monitoring-triggered event. It should be in the team’s annual calendar.
11. Reference Architecture Diagram

Appendix A: Key Delta Tables for Monitoring
| Table | Schema | Owner | Write Cadence |
|---|---|---|---|
monitoring.scoring_runs |
run_id, model_name, model_version, score_date, input_rows, output_rows, score_stats, status, runtime | MLE | Every scoring run |
monitoring.drift_metrics |
model_name, feature_name, metric_type (PSI/KL/JS), metric_value, scoring_date, reference_date, alert_triggered | MLE | Every scoring run |
monitoring.performance_metrics |
model_name, metric_name, metric_value, evaluation_date, prediction_period, ground_truth_as_of, model_version | MLE + DS | Monthly (lagged) |
monitoring.alert_history |
alert_id, model_name, tier, description, triggered_at, resolved_at, resolved_by, root_cause | MLE | On alert |
monitoring.model_promotions |
model_name, from_version, to_version, promoted_by, approved_by, promotion_date, validation_report_path | MLE | On promotion |
data_quality.test_results |
test_name, table_name, status, tested_at, failure_details | DE (dbt) | Every dbt run |
This document is a living artifact. It should be reviewed quarterly by the MLOps Engineer, Data Scientist, and Actuarial SME, and updated as the platform matures, models evolve, and organizational practices change.